Prediction of hypertension using traditional regression and machine learning models: A systematic review and meta-analysis

Objective We aimed to identify existing hypertension risk prediction models developed using traditional regression-based or machine learning approaches and compare their predictive performance. Methods We systematically searched MEDLINE, EMBASE, Web of Science, Scopus, and the grey literature for studies predicting the risk of hypertension among the general adult population. Summary statistics from the individual studies were the C-statistic, and a random-effects meta-analysis was used to obtain pooled estimates. The predictive performance of pooled estimates was compared between traditional regression-based models and machine learning-based models. The potential sources of heterogeneity were assessed using meta-regression, and study quality was assessed using the PROBAST (Prediction model Risk Of Bias ASsessment Tool) checklist. Results Of 14,778 articles, 52 articles were selected for systematic review and 32 for meta-analysis. The overall pooled C-statistics was 0.75 [0.73–0.77] for the traditional regression-based models and 0.76 [0.72–0.79] for the machine learning-based models. High heterogeneity in C-statistic was observed. The age (p = 0.011), and sex (p = 0.044) of the participants and the number of risk factors considered in the model (p = 0.001) were identified as a source of heterogeneity in traditional regression-based models. Conclusion We attempted to provide a comprehensive evaluation of hypertension risk prediction models. Many models with acceptable-to-good predictive performance were identified. Only a few models were externally validated, and the risk of bias and applicability was a concern in many studies. Overall discrimination was similar between models derived from traditional regression analysis and machine learning methods. More external validation and impact studies to implement the hypertension risk prediction model in clinical practice are required.


Introduction
Hypertension is a common medical condition affecting about 1 in 4 people [1] and is a significant risk factor for heart attack, stroke, kidney disease, and mortality [2]. Hypertension has been linked to 13% of deaths globally [3] and is a significant health burden that affects all population segments. Considering the high prevalence and global burden, hypertension prevention, and control strategies need to be a top priority. Hypertension can be prevented by applying strategies that target the general population or individuals and groups at higher risk for hypertension [4]. The need for early identification of at-risk individuals who could benefit from preventive interventions has led to a growing interest in hypertension risk prediction.
Predicting the risk of developing hypertension through modeling can help identify important risk factors contributing to hypertension, provide reasonable estimates about future hypertension risk [5], and help identify high-risk individuals targeted for healthy behavioral changes and medical treatment to prevent hypertension [6][7][8]. Many prediction models have been developed to predict the risk of hypertension in the general population over the years. Models were developed using either a traditional regression-based approach or a modern machine learning approach. Although machine learning approaches are known to produce better predictive performance, their performance often varies, and it is not clear if they perform better than the traditional regression-based models in predicting hypertension. Through a systematic review and subsequent meta-analysis, a pooled synthesis of performance measures of different models produced in multiple studies can be compared and measured [9]. This methodology provides an overview of these models' predictive ability and allows the models' performance measures based on the reported data to be explored quantitatively [9]. Two prior studies systematically analyzed hypertension risk prediction models in adults [10,11]. Both studies performed a narrative synthesis of the evidence to summarize hypertension prediction models' existing knowledge, and one study also performed a meta-analysis without assessing heterogeneity. None of the prior studies stratified models according to how they were developed. This stratification is important because there are inherent differences in these two types of models' developmental methods in computation, complexity, interpretability, and accuracy. A formal assessment of study quality was also absent in prior studies. In addition to these two prior reviews, a systematic review was also carried out on prediction models to classify children at an elevated risk of developing hypertension [12].
With this in mind, we aimed to 1) systematically review the literature to identify hypertension risk prediction models that have been applied to the general adult population and the risk factors that were considered in those models; 2) characterize the study populations in which these models were derived and validated, 3) compare the predictive performance of traditionally developed regression-based models and machine learning models, and 4) assess the quality of these prediction models to better inform the selection of models for clinical implementation.
independently. Lastly, articles containing extractable data on hypertension prediction models and hypertension risk factors were selected for data extraction. Inter-rater reliability (Kappa coefficient) was estimated to measure agreement between the independent reviewers. Any disagreement between reviewers was resolved through consensus.

Data extraction
Two reviewers (MC and IN) independently extracted data from each study using standardized forms. We classified the identified models into two categories: models developed using a traditional regression-based approach and models developed using machine learning algorithms. Separate data extraction sheets were used for each model type and included study name, the location where the model was developed/location of data used for the model developed and participants' ethnicity, study design used, sample size, age, and gender of the study participants, risk factors included in the model, number of events and total participants, an outcome considered, the definition used for hypertension, duration of follow-up, modeling method used, measures of discrimination and calibration of the prediction model, and the validation of the prediction model. In a separate form, information about the externally validated hypertension risk prediction models was extracted, including study name/model validated, the total number of validation studies, location of the validation study, follow-up period, number of events, and total participants, the definition of outcome and discrimination and calibration of the model. We also extracted information about risk factors, particularly how many times a specific risk factor was considered in the models. Each reviewer assessed study quality according to the Prediction model Risk Of Bias ASsessment Tool (PROBAST) checklist [14,15]. The PROBAST is designed to evaluate the risk of bias and concerns regarding diagnostic and prognostic prediction model studies' applicability. The PROBAST contains 20 questions under four domains: participants, predictors, outcome, and analysis, facilitating judgment of risk of bias and applicability. The overall risk of bias of the prediction models was judged as "low", "high", or "unclear," and overall applicability of the prediction models was considered as "low concern", "high concern", and "unclear" according to the PROBAST checklist [14,15].

Data analysis
We summarized the number of studies identified and those included and excluded (with the reason for exclusion) from the systematic review and subsequent meta-analysis using the PRISMA flow diagram [16]. In data synthesis, we performed a meta-analysis on the performance measure of the traditional regression type's prediction modeling (e.g., logistic regression model and Cox proportional hazard regression model) and a more complicated modeling strategy (e.g., machine learning tools). Discrimination and calibration are the two most common statistical measures of predictive performance. Discrimination is commonly quantified by the concordance (C) statistic. In this review, we performed a meta-analysis on the C-statistic or AUC (area under the receiver operating characteristic curve) to evaluate the models' predictive performance and provided a comprehensive summary of the models' predictive ability. We did not undertake a meta-analysis of the calibration due to the unavailability of relevant data.
We logit transformed the C-statistics before pooling as per recommendation [17,18] and then back-transformed the results to the original scale for interpretation. We used a randomeffects meta-analysis with REML estimation and Hartung-Knapp-Sidik-Jonkman (HKSJ) confidence interval (CI) to obtain the pooled weighted average of the logit C-statistic [19]. Forest plots were generated to show the pooled C-statistic together with the 95% CI, 95% approximate prediction interval (indicates an expected performance range of the considered models in a new population) for the summary C-statistic, the author's name, publication year, and study weights. In studies that only provided a C-statistic but no measure of its variance or confidence intervals, the standard error (SE) and 95% CI of the logit C-statistic (or area under the receiver operating characteristic curve (AUC)) was calculated using the appropriate formula [19]. However, when the C-statistics' confidence intervals (CIs) were available, standard errors (SE's) of the logit C-statistics were derived from the CIs [19]. The presence of heterogeneity (primarily due to differences in the study setting, participants, and methodology) was assessed using Cochran's Q statistic and quantified with the I 2 statistic. A p-value of less than 0.05 was considered statistically significant heterogeneity and was categorized as low, moderate, and high when the I 2 values were below 25%, between 25% and 75%, and above 75%, respectively [20]. Sources of heterogeneity were further explored using meta-regression and stratified analyses according to modeling type and study characteristics (sex of the participants, age of the participants, number of risk factors considered in the model, sample size considered in the model, and ethnicity of the study participants). We calculated 95% prediction intervals to provide a likely range of performance of a prediction model in a new population and setting. We did not assess publication bias by any statistical tests or funnel plot asymmetry. We used Stata version 16.1 (StataCorp LP, College Station, TX, USA) to perform statistical analysis using the following commands: meta, metan and metareg.

Study identification and selection
We identified 14,730 articles through our electronic database search and an additional 48 articles through our grey literature search. After removing duplicates, titles, and abstracts screening and full-text screening 52 articles were finally selected for the systematic review. Within the chosen final studies, 32 studies provided sufficient information for synthesis through a meta-analysis. The detailed study selection process is summarized in Fig 1. Agreement between reviewers on the initial screening and final articles eligible for inclusion in the systematic review was good (κ = 0.81, and κ = 0.89, respectively). A total of 117 models were identified from the finally selected articles predicting the risk of hypertension in the general adult population, of which 75 were developed using traditional regression-based modeling and 42 using machine learning tools.

Study characteristics of traditional regression-based models
Study characteristics of traditional regression-based models are presented in Tables 1 and 2. A total of 573,268 participants were used to develop 75 traditional models in 34 studies. Models mainly were developed either in white Caucasian or Asian populations. There was no model derived from African populations and only one [21] from Latin American populations. Two studies considered only male participants, one study considered only female participants, and the remaining studies considered both to develop the models. The number of risk factors considered to create the models ranged from 1 to 19, with a median of 7 risk factors per model. Age was the most common risk factor considered in 61 models, followed by body mass index (BMI) (32 models), diastolic blood pressure (DBP) (28 models), systolic blood pressure (SBP) (27 models), and sex (21 models). The distribution of the conventional risk factors considered in the different models is presented in Fig 2A. Duration of follow-up time (mean/median/ total) considered to develop the models varied between 1.6 years to 30 years. The age of the study participants ranged from 15 to 90 years. SBP � 140 mm Hg, DBP � 90 mm Hg, or use of antihypertensive medication was the standard definition used to define hypertension in almost all the studies, except one study where SBP � 130 mm Hg, DBP � 80 mm Hg, or use of any antihypertensive drug was used. Logistic regression was the most used methodology to develop the model (15 studies), followed by Cox proportional-hazards regression (11 studies) and Weibull regression (6 studies). Calibration of the prediction model was not reported by most of the studies (19 studies). Studies those reported calibration measures (15 studies) were mainly using the Hosmer-Lemeshow test. Discrimination was assessed using the C-statistic (or AUC) and reported by almost all studies with values ranging from 0.57 to 0.97. Only one model was externally validated by the same study when they developed the model. Only eight models [22][23][24][25][26][27][28][29] were converted into a risk score after model development.

Meta-analysis of traditional regression-based models
The overall pooled C-statistics of the traditional regression-based models was 0.75 [0.73-0.77] with high heterogeneity in the discriminative performance of these models (I 2 = 99.3, Cochran Q-statistic p < 0.001) (Fig 3). Stratified pooled results by modeling type showed pooled C-   (Fig 3). The heterogeneity was still observed to be high within the different types of models (Fig 3). The 95% approximate prediction interval for the overall C-statistics was from 0.63 to 0.84. To explore possible sources of heterogeneity in the overall pooled C-statistics, we performed a meta-regression. We initially considered the following potential sources of heterogeneity: the definition of hypertension used (the cut-off level used to define hypertension), sex of the participants in included studies (categorized as female-only, male-only, and both male and female), age of the participants (study participants below average age versus above average age), number of risk factors considered in the model (below median versus above median), sample size considered in the model (below median versus above median), and ethnicity of the study participants (Whites versus Asians). However, we excluded the definition of hypertension as a heterogeneity source, as all studies except one used the same definition for hypertension. Metaregression identified the participants' sex, that is, being male compared to female (p = 0.044), participants' age (p = 0.011), and the number of risk factors considered in the model (p = 0.001) as potential sources of high heterogeneity in the C-statistic. Sex of the participants' when both male and female compared to female-only (p = 0.351), sample size considered in the model (p = 0.395), and ethnicity of the study participants (p = 0.899) were not identified as a statistically significant source of observed heterogeneity in the C-statistic of these models.

Critical appraisal of traditional regression-based models
We assessed study quality using the PROBAST checklist. A detailed assessment of the risk of bias (ROB) and applicability is presented in S2 Table and Fig 4. Overall, ROB was "low" in 19 studies, "high" in 5 studies, and "unclear" in 10 studies. Overall applicability was "low concern" in 12 studies, "high concern" in 21 studies, and "unclear concern" in 1 study. Within the ROB domains, the "low" risk of bias was observed in most of the domains except the "analysis" domain, where a large portion of studies (more than 30%) was "unclear" (Fig 4). Similarly, within the applicability domains, the "participants" domain seems to be a concern, as a large portion of studies (more than 30%) were at "high concern" or "unclear concern" (Fig 4). We also presented the different PROBAST signaling questions' distribution of responses by the various studies in S1 and S2 Figs.

Study characteristics of machine learning-based models
Study characteristics of machine learning-based models are presented in Table 3. A total of 1,211,093 participants were used to develop 42 machine learning-based models in 20 studies.

PLOS ONE
Models were primarily developed either in white Caucasian or Asian populations. The number of risk factors/features considered to create the model ranged from 2 to 169, with a median of 7 risk factors per model. Age was the most common risk factor considered in 25 models, followed by sex/gender (8 models), BMI (7 models), DBP (6 models), smoking (6 models), and parental history of hypertension (6 models). The distribution of the conventional risk factors

PLOS ONE
considered in machine learning models is presented in Fig 2B. Hypertension was predominantly defined using SBP � 140 mm Hg, DBP � 90 mm Hg, or antihypertensive medication. Artificial neural network (ANN) was the most common method used to develop the models. Different studies reported different performance measures, and accuracy and AUC/C-statistic were the two most commonly reported measures. Most of the studies did not report calibration measures. In studies that reported discrimination, the AUC (or C-statistic) values range from 0.64 to 0.93.

Meta-analysis of machine learning-based models
The overall pooled C-statistics of the machine learning-based models was 0.76 [0.72-0.79] with high heterogeneity in the discriminative performance of these models (I 2 = 99.9, Cochran     Q-statistic p < 0.001) (Fig 5). Like traditional regression-based models, we did not perform stratified pooled results by modeling type due to diversity in the modeling method. The 95% approximate prediction interval for the overall C-statistics was from 0.63 to 0.84 (Fig 5).
We explored possible sources of heterogeneity in the overall pooled C-statistics through meta-regression as before. However, meta-regression did not identify any of age of the participants (p = 0.358), the number of risk factors considered in the model (p = 0.812), sex of the participants, that is being male compared to female (p = 0.886) and both male and female compared to female-only (p = 0.787), sample size considered in the model (p = 0.577), or ethnicity of the study participants (p = 0.326) as the potential source of high heterogeneity in the Cstatistic.

Study characteristics of externally validated models
Only four models [22,[30][31][32] were found to be externally validated in a different population. Detailed characteristics of the studies that validated these four models are presented in S3 Table. The Framingham hypertension risk model (FHRS) is the only validated model in more than one external population. The FHRS [22] model was validated by eight different studies in diverse populations of 122,348 participants. Study participants had an age range of 18 to 84 years with follow-up time (mean/median/total) from 1.6 years to 25 years. Almost all studies reported performance measures of the FHRS. The Hosmer-Lemeshow test was used to report calibration, while the C-statistic (or AUC) was used to report discrimination. The values of the reported C-statistic ranged from 0.54 to 0.84. Models by Lim et al. [30], Völzke et al. [31], and Kanegae et al. [32] were validated only once in an external population by the same authors. Within these three models, performances were best for the model by Kanegae et al. [32], with a C-statistic of 0.85 [0.76-0.91].

Meta-analysis of externally validated models
The pooled C-statistic of the FHRS [22] model was 0.75 [0.68-0.80] with high heterogeneity in the discriminative performance of this model (I 2 = 99.6, Cochran Q-statistic p < 0.001) (S3 Fig). The 95% approximate prediction interval for the C-statistic in the FHRS [22] was from 0.47 to 0.91 (S3 Fig). As the other three models were externally validated only once, pooling their performance measure was irrelevant.

PLOS ONE
We explored possible sources of heterogeneity in the pooled C-statistics through metaregression, and only the ethnicity (Whites versus Asians) of the study participants (p = 0.044) was identified as a source of high heterogeneity in the C-statistic of the FHRS model [22].

Models developed using genetic risk factors/biomarkers
Genetic risk factors/biomarkers often contribute significantly to developing hypertension, and models were developed considering both conventional risk factors and biomarkers. In addition, there were models where biomarkers were used primarily in model building. Information about models developed using biomarkers (e.g., genetic risk scores) is presented in S4 Table. There were 11 studies where genetic risk factors/biomarkers were used in model building. Biomarkers are often considered very important for increasing the predictive performance of models. However, the pooled predictive performance (C-statistic) of the models that considered biomarkers primarily was 0.76 [0.71-0.80] (S4 Fig) and did not show an overall improvement in the models' predictive performance. Including genetic factors/biomarkers in the model has some drawbacks. Because information on those biomarkers is frequently unavailable and interpreting the models becomes difficult, the models become less suitable for daily clinical practice.

Discussion
Many hypertension risk prediction models with reasonable predictive performance were identified in this systematic review, but only a few had external validation. Bias and applicability were noted as major concerns in many studies. Overall, there was little difference in the predictive performance of traditional statistical and machine learning models. Our findings are expanded on in the sections that follow.
The models were developed mostly in Caucasian or Asian populations. Because certain ethnic groups are more prone to hypertension (e.g., people of African descent [33]), research should include a diverse range of patients to create hypertension risk prediction models. Most of the traditionally developed models considered conventional risk factors for hypertension, which are readily available in clinical practice. Some models also used genetic risk factors, although the inclusion of genetic risk factors into the model did not improve the overall predictive performance of the models. The pooled analysis identified the overall predictive performance of the traditional regression-based models was good but with high heterogeneity. Stratified analysis by modeling methodology (e.g., logistic, Cox) within traditional regressionbased models did not show much difference in predictive performance, and heterogeneity was still observed within the modeling methodology. The traditional models we identified in our search were mostly internally validated, often considered not enough for models' generalizability [34]. The FHRS [22] was the only model that had multiple external validations and good/ acceptable pooled predictive performance. However, because the FHRS [22] showed high heterogeneity in its predictive performance, with ethnicity serving as a source of heterogeneity, and the model was built predominantly in a White population, we must proceed with caution when applying it to a completely different population. Models that have only single, or no validation need external validation, preferably by a different group of investigators, to guarantee the model's generalizability to a different population. Only a few traditional models were converted into risk score after their development. Presenting the risk derived from the model through scoring instead of a complex mathematical formula may facilitate the use of prediction models and subsequently improve the uptake of prediction models in clinical practice. The risk of bias (ROB) was "high" or "unclear" in a large portion of traditional model studies. This is primarily because many studies failed to meet the criteria in the "analysis" domain of ROB. In many studies, the applicability of the models was rated as "high concern" or "unclear concern" due to a failure to properly fulfil the "participants" criteria. Several models were developed in a specific population, making the models less applicable to the general adult population.
Since machine learning tools are more recent, advanced, and have a reputation for producing more accurate predictive performance, we assumed that models developed with these tools would outperform traditional regression-based models. However, we did not notice much difference in predictive performance between these two types of models. A few machine learning-based models (e.g., models by Huang et al. [35], Sakr et al. [36], and Ye et al. [37]) showed excellent discriminative performance; however, none of these models has ever been externally validated in an entirely different new population. In fact, none of the machine learning-based models have been externally validated. Consequently, the performance of those models in a new setting/population is quite uncertain. We also noticed high heterogeneity in the predictive performance (C-statistic) of machine learning models. Meta-regression using potential sources of heterogeneity failed to identify the real source of heterogeneity. One possible explanation is a difference in the methodology used to develop the machine learning-based models. Due to the various methods considered in different models, we were unable to investigate this potential source. We did not notice higher expected variability in machine learning-based models' future predictive performance compared to traditional regression-based models, as the 95% prediction interval for machine learning-based models was similar to traditional regressionbased models.
We did not find any studies in this review that assessed the impact of adopting hypertension risk prediction models in clinical settings. Ideally, a prediction model, regardless of its development, should have an impact study to assess whether it improves clinical decision-making and patient health outcomes [5,38].
There were two previous reviews on a similar topic where hypertension risk prediction models were identified through a systematic search and described their characteristics. Our review is different from previous studies and contributes to information on the prediction of hypertension risk and the identification of associated risk factors in the following ways: 1) we synthesized performance of the prediction models through meta-analysis and explored potential sources of heterogeneity; 2) we compared the performance of the prediction models developed using traditional statistical regression-based models and more recent machine learningbased models; 3) we provided a thorough evaluation of the quality of the studies among traditionally developed regression-based models; and 4) we described several additional models that have recently been derived.
One of our study's strengths is the extent of the systematic search, which includes four different databases, grey literature, and extensive use of the reference lists of the identified studies. To the best of our knowledge, this is the first study where a meta-analysis of predictive performance, together with assessment of heterogeneity, comparison of the predictive performance of traditional regression based-models and machine learning-based models, and a detailed critical appraisal of studies in hypertension risk prediction models has been performed. Nevertheless, our study also has limitations. We excluded non-English and non-French publications. While it is widely perceived that the English language is the primary language of science, the choice of scientific results in a particular language can incorporate language bias and may lead to incorrect conclusions [39]. We were only able to use C-statistics to compare the model performance, which could be insensitive to distinguish a model's ability to correctly stratify patients into clinically relevant risk groups [39,40]. Calibration was quantified by different measures, and different studies often reported different calibration measures. This led to difficulty in synthesizing calibration measures through meta-analysis. A meta-analysis of calibration measures (e.g., O/E ratio) along with C-statistics could provide a comprehensive summary of the performance of these models [19]. Failing to assess publication bias amongst the studies is another potential limitation of this study. Recent guidelines [19] did not emphasize the need to assess publication bias for prediction model performance, which encouraged us not to do so. Although studies have considered publication bias in a similar scenario before, we believe existing traditional publication bias assessment tools (e.g., funnel plot, Egger's test, Begg's test) are more appropriate for studies assessing statistically significant results (e.g., randomized controlled trial (RCT)) than studies assessing predictive performance (e.g., C-statistic) of the prognostic models. Instead, we assessed ROB using the PROBAST checklist. We also could not appraise studies that use machine learning algorithms to predict hypertension. Although most of the PROBAST signaling questions also apply to appraise machine learning algorithms, additional signaling questions are recommended to add due to differences in data analysis methods for machine learning algorithms and regression-based models [14,15]. Machine learning algorithms use different variable selection strategies, different estimation techniques for variable-outcome estimations, and different ways to adjust for overfitting [14,15]. When additional questions are added to the PROBAST, these questions need to be appropriately phrased, and specific guidance on assessing these signaling questions also needs to be provided [14,15]. Considering these additional works, we refrain from appraising studies considered machine learning algorithms. Finally, despite our attempt to capture potential sources of heterogeneity in our study, we asked readers to be cautious while interpreting our findings as there may be a potential bias in our findings due to a limited number of studies included in the analysis and the study's failure to incorporate additional potential sources of bias in the analysis.
In summary, we attempted to provide a comprehensive evaluation of hypertension risk prediction models. We identified many models with acceptable-to-good predictive performance. We did not notice significant differences in the predictive performance of traditional regression-based models and machine learning-based models. Including genetic risk factors/biomarkers also did not show much improvement in the models' predictive performance. The quality of the studies was reasonable, with areas where further improvement is needed. Only a few of the multiple models developed had been externally validated, which is a concern. Also, there is a lack of impact studies. Models with external validation and impact studies are required to implement a prediction model in a clinical practice guideline. A model with accurate prediction is not beneficial if it is not generalizable to a different population or improves clinical decision-making and patient health outcomes.